Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

134 ◾ Bioinformatics

4.2.2.2.7 Mark the duplicate reads

GATK4 best practice recommends removing or marking the duplicate reads mapped to the

reference genome to avoid biases in variant calling. In the above examples, we used sam-

tools to mark the duplicate reads. With GATK4 best practice, we can use Picard to mark

the duplicate. Picard is a set of command-line tools for manipulating sequencing data such

as SAM/BAM and VCF. To install Picard, follow the instructions at “https://broadinsti-

tute.github.io/picard/”.

The Picard MarkDuplicates function works by comparing reads in the 5′ positions and

collecting the duplicates. The function identifies the primary and duplicate reads. The

duplicates are marked with the hexadecimal value of 0x0400. For more details, check out

Picard manual.

The following script creates a directory called “dedup”, marks the duplicate alignments

in BAM files and indexes the resulted BAM files, and stores them in the new directory:

mkdir dedup

cd chr21

for i in $(ls *.bam|rev|cut -c 5-|rev);

java -Xmx7g -jar ~/software/picard.jar MarkDuplicates \

-INPUT ${i}.bam \

-OUTPUT ../dedup/${i}.dedup.bam \

-METRICS_FILE ../dedup/${i}.dedup_metrics.txt

samtools index ../dedup/${i}.dedup.bam

done

cd ..

4.2.2.2.8 Adding read grouping header to each BAM file

RGs are identified in the BAM file by a number of tags that are defined according to a specific

format. We can use the following samtools command to display the RG tags on a BAM file:

samtools view -H filename.bam | grep “^@RG”

GATK4 RGs’ tags allow us to differentiate samples and technical features that are asso-

ciated with artifacts. Such information is used to reduce the effects of artifacts during the

duplicate marking and base recalibration steps. GATK requires several RG fields to be pres-

ent in input files and will fail with errors if this requirement is not met. The RG fields are

usually set when the reads are aligned to a reference genome by the aligner. However, if the

BAM files do not include the RG fields, we can use the “AddOrReplaceReadGroups” Picard

function to add or modify the RG fields as we will do next. AddOrReplaceReadGroups uses

the following arguments to add or modify the RG fields on a BAM file:

RGID: This adds group read ID, which must be unique across the samples.

RGPU: This argument is optional for GATK, and it holds the flow cell barcode, lane ID,

and sample barcode separated by a period “.”.

RGSM: This holds the sample name that will be used later as a header of sample column

in the VCF file.